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ABSTRACT 



Efficient techniques for inducing rules used in classifying 
data items on a noisy data set The prior-art TREP technique, 
which produces a set of classification rules by inducing each 
rule and then pruning it and continuing thus until a stopping 
condition is reached, is improved with a new rule-value 
metric for stopping pruning and with a stopping condition 
which depends on the description length of the rule set The 
rule set which results from the improved IREP technique is 
then optimized by pruning rules from the set to minimiz e the 
description length and further optimized by making a 
replacement rule and a modified rule for each rule and using 
the description length to determine whether to use the 
replacement rule, the modified rule, or the original rule in the 
rule set Further improvement is achieved by inducing rules 
for data items not covered by the original set and then 
pr unin g these rules. Still further improvement is gained by 
repeating the steps of inducing rules for data items not 
covered, pruning the rules, optimizing the rules, and again 
pruning for a fixed number of times. The fully-developed 
technique has the Ofnlog^n) running time characteristic of 
TREP, but produces rule sets which do a substantially better 
job of classification than those produced by IREP. 

18 Claims, 9 Drawing Sheets 
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ripperfdota) 

// data is o set of examples; 
// returns a ruteset 

/♦ this is IREP* ♦/ 
hyp - empty_rule_set; 

703 — - hyp = add_rules(dota,hyp); 

705 — - hyp = reduce_dlen(hyp,data); 



•403 



/* this is RIPPER, being iterated k times ♦/ 



for i=1, 

hyp 
hyp 
hyp 

i 

return hyp; 



k j ,707 
opt"mi2e_rules(data l h)rp); 
odd_ rules (data, hyp); 
reduce_dlen(hyp,dota); 



FIG. 7 
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701 



/' optimize the ruleset hyp, end possibly odd some more rules 
• a special case is when hyp is empty — then this builds a ruleset 

V 

optimize_ruies(dota,hyp) 

// doto is o set of examples; 

// hyp is the ruleset thot computed the last time oround 709 

/* optimize existing rules */ 
f for rule.num = 1, . , . t (number of rules in hyp) j 

/• split the data into growing and pruning sets •/ 
partition(data,grow_data,prune_data); 

/* save the old rule */ 
old.rule = hyp[rule_num]; — 710 

/* build o new rule ♦/ 

new_rule = new rule with empty body asserting class to be V; ^ 
new_rule = refine(new_rule 1 grew_data); ? 713 

new.rule = simpnf^new.rule.hyp.rule.num.prune.dota); J 

/* build a revised rule ♦/ 

revised_rule = hyp[rule_num]; ^ 

revised.rule = refine{revfsed_rule ( grow_dato); > 71 5 

revised.rule = simplifyfrevised.rule.hyp, rule, num, prune, data); J 

/* pick one of the old, new or revised rules */ N 
new.vol = relative_compression(new_rule»hyp l rule_num ( dota); 
rev_vol = relative. compression(revised_rule,hyp ( rule.num l dota); 
old_val = relatrve_compression(old.rule 1 hyp,rule_num ) dota); 
if (old.val >= new_val and old_vol >= rev_val j 

chosen_rule = okLrule; r 717 

j else if (revival >= new_val) | 
chosen_rule = revised.rule; 
| else | 

chcsen_rule = new_rule; 

i 



712 < 



remove examples covered by chosen_rule from dota; 
hyp[rule_numj = chosen, rule; 



-721 
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FIG. 8 



add_rules(data,hyp) 
// data is a set of examples; 
// hyp is o ruleset; 

\ 

remove examples covered by any rules in hyp from data; — -803 

/• add new rules for uncovered examples */ 
f while ((there are positive examples in data) 801 
and lasLrule_accepted) 

/* split the dato into growing and pruning sets */ 
portition(dato,grow„data ) prune_data); 805 

/* build a new rule */ ^ 
new_rule = new rule with empty body asserting class to be V; 
804 j 807 — new_rule = refine(new_rule,grow_data); t> 806 

] 809 — — new„rule = simplif^new^rule.hypjrufe.num.prune.data); 

/* decide if you should keep the new rule */ 
if (reject_rule(new_rule,data)) j 
811' ' lasLrule_ accepted = FALSE; 
| else | 

remove examples covered by new.rule from data; — —813 
append the new_rule to hyp; — 815 

V *\ 

I 

return hyp; 
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/• decide if c new rule should be added to the ruleset •/ 
rejecLrule(rule,data ( hyp ( rule_num) 

if (total compression of hyp with rule and posn rulenum >= 
best compression seen to date + MAX_DECOMPRESSION) 

I 

return TRUE; 

! 

else If (error rote of rule on data > 50%) { 
return TRUE; 



FIG. 9 



else | 



return FALSE; 



> 911 



> 913 



901 



/• grow the rule */ 
refine(rule l doto) 



last.refinemenLrejected = FALSE; 
r while (negative examples in data are covered by rule) ) 

refinement = refinement ref of rule with mox ref.valuefrule.ref.dota); 
if (rejecLrefinementfrefinemenUule.data)) I 
lasLrefinemenLreject = TRUE; 
904 < \ else | , 

rule = refinement; — 
remove from data examples not covered by refine; 

! 

I I 

return rule; 



/• value function used in refining a rule •/ 
ref _value(old_ rule.ref ined_ rule,do to) 

pi = number of positive exomples in data covered by old. rule; 
nl = number of negative examples in data covered by old_rule; 
p2 = number of positive examples in data covered by refined.rule; 
n2 = number of negative examples in data covered by refined_rule; 
/* return 'information gain* */ 
return p2*[log2((p1+n1)/p1) - log2((p2+n2)/p2)] — g 07 



905 



/♦ generalize a rule */ 
simplify( rule, hyp, ru1e_ nu rn , data) 



t 



910 < 



f while (body of rule is not empty) j 

gen = generalization of rule with best gen_value(rule l hyp,rule_num,dota); 
if (value of gen <= value of rule) \ 

break; 
| else { 

rule = gen; 229 



return rule; 
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/♦ value function used in generalization ♦/ 
|en_value(rule,hyp,rule_num,data) 

if (rule_num < frules in hyp) j 

/* optimizing o rule — use accuracy of rule in context */ 
hypl = copy of hyp; 
hypl [rule_num] = rule; 

e = number_of_errors made by hypl on data; 
tot = number of examples in data; 
return 1 - e/tot; 
{ else | 

/♦ use heuristic function from paper */ 
p = number of positive examples in doto covered by rule; 
n = number of negative examples in dato covered by rule; 
return (p_n)/(p+n); — —1007 



1001 



H005 



i 



I 



/* reduction in description length obtained by inserting rule 
In hyp at position rule_num, relative to deleting that rule 
from the hypothesis 

v 

relative. compression(rule,hyp,rule.num,data) 



1009 



nulLrule = new rule with body "false" asserting class to be "+" 

hyp_with = copy of hyp; 

hyp_with[rule_num] = rule; 

hyp.wilh = reduce_dlen(hyp_with,dato); 

hyp_ without = copy of hyp; 
hyp_with[rule_numj = nulLrule; 
hyp.without = reduce_dlen(hyp_without,data); 

dlen_with = data_dlen(hyp_with,data); 
dlen_without = data_dlen(hyp_without,doto); 

return dlen.without - dlen_with+rule_dlen(rule); 

/ 

1017 



1011 



1013 



1015 
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/♦ description length of data given a hypothesis — ie number of bits 
* needed to encode the exceptions to the predictions made by nyp •/ 
doto_dlen(hyp,dolo) 

apply rules in hyp to data and compute these stotistics: 
fp = jffalse positives; 
fn = jjffalse negatives; 
cov = jjfexcmples covered; 
uncov = ^examples not covered; 

/• return #bits to encode the exceptions (fp,fn) 
using the method of (Quinlan,ML95) 

V 

define subseLdlen(n,e t p) = -Log2(p) # e + -Log2(t-p)*(n-e) 

e = fn+fp; 
if (cov >= uncov) { 
return 
Log2(cov+uncov+1) 
+subseLdlen(cov t fp ) 0.5 # e/cov) 
+subset_dlen(uncov,fn t fn/uncov); 

( else | 

return 
log2(cov+uncov+1) 

+subseLdlen(uncov l fn,0.5V uncov ) 
+subseLdlen(cov,fp,fn/uncov); 



/• reduce description length of hypothesis by deleting bod rules */ 
reduce_dlen(hyp,data) 

n = frules in hyp; 
f for i=n, ... ,1 do 

hypl = copy of hyp with rule i deleted; 
1111 < 1113 - — if (total_dlen(hyp1,data) < totaLdlen(hyp t dato)) | 
hyp = hypl; 



1109 



endfor 
return hyp; 



/• total description length of hypothesis and doto */ 
totoLdlen(hyp,data) 



n = jfrules in hyp; H_15 

1 1 17 - — tot = doto.dlenfhyp.dota); 

for i=n, ... ,1 do tot += rule_dlen(hyp(i]); — 11^9 
return tot; 
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RULE INDUCTION ON LARGE NOISY DATA 
SETS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 5 
The invention relates generally to machine learning tech- 
niques and mere particularly to techniques for inducing 
classification rules which will efficiently classify large* noisy 
sets of data. 

2. Description of the Prior Art 10 
Machine Classification: FIG. 1 

One of the most common human activities is classifica- 
tion. Given a set of objects, we classify the objects into 
subsets according to attributes of the objects. For example, 
if the objects are bills, we may classify them according to the is 
attribute of payment date, with overdue bills making up one 
subset, due bills another, and not yet due bills the third. 

Classification has always been expensive, and has accord- 
ingly always been mechanized to the extent permitted by 
technology. When the digital computer was developed, it 20 
was immediately applied to the task of classification. FIG. 1 
shows a prior-art classification system 101 which has been 
implemented using a digital processor 195 and a memory 
system 103 for storing digital data. Memory 103 contains 
unclassified data 107 and classifier 111. 25 

Unclassified data 107 is a set of data items 108. Each data 
item 108 includes attribute values 118(0. .n) for a number of 
attributes 117(0.,n). In the bill example, the attributes 117(0 
of data items 108 representing bills would include the bill's 
due date and its past due date and the attribute values 118(0 30 
for a given data item 108 would include the due date and the 
past due date for the Mil represented by that data item. 
Classifier 111 includes a classifier program 115. Operation 
of system 101 is as follows: processor 105 executes classi- 
fier program 115, which reads each data item 108 from 35 
unclassified data 107 into processor 105, classifies the data 
item 108, and places it in classified data 109 according to its 
class 110. In the bill example, there would be three classes 
110, not yet due, due, and overdue. 

While it is possible to build a classifier program 105 in 40 
which the classification logic is built into the program, it is 
common practice to separate classification logic 113 from 
the program, so that all that is necessary to use the program 
to classify different kinds of items is to change classification 
logic 113. One common kind of classification logic 113 is a 45 
set of rules 119. Each rule consists of a sequence of logical 
expressions 121 and a class specifier 123. Each logical 
expression 121 has an attribute 125 of the data items being 
classified, a logical operator such as =, <, >, ^, or i=, and a 
value 131 with which the value of attribute 125 is to be 50 
compared. Continuing with the bill example, classifier logic 
113 for the bills would be made up of three rules: 
past_due_datc<curr_date-K)verdue 
due_date=<curr_date AND 

pa5t_due_dato=curr_datc— ►due 55 
due_date>curr_date->not yet due 

The expression to the right of the — > symbol is the class 
to which the rule assigns a data item; the expression to the 
left is the sequence of logical expressions. To classify a data 
item, classifier program 115 applies rules 119 to the data 60 
item until one is found for which all of the logical expres- 
sions are true. The class specified for that rule is the class of 
the data item. Thus, if the due date for a bill is June 1, the 
past due date June 15, and the current date June 8, executing 
program 115 with the above set of rules will result in the 65 
application of the second rule above to the data item for the 
bill, and that will in turn classify the bill as being "due**. 



Inducing Classifier Logic 113: FIG. 2 

Building classifier logic 113 fox something as simple as 
the bill classification system is easily done by hand; 
however, as classification systems grow in complexity, it 
becomes necessary to automate the construction of classifier 
logic 113. The art has consequently developed systems for 
inducing a set of rules from a set of data items which have 
been labeled with their classifications. 

FIG. 2 shows such a system 201. again implemented in a 
processor and memory. System 201 includes classified data 
201 and induction program 205. Classified data 201 is 
simply a set of data items 108 in which each data item 108 
has been classified As shown at 203, each classified data 
item 203 includes values for a number of attributes and a 
class specifier 123 for the class to which the data item 
belongs. Classifier logic 113 is produced by executing 
induction program 205 on classified data 201. 

There are two techniques known in the art for inducing 
classifier logic 113. In the first technique, induction program 
205 begins by building classifier logic 113 that at first 
contains much more logic than is optimum for correctly 
classifying the data items and then prunes classifier logic 
113 to reduce its size. In the second technique, classifier 
logic 113 is built piece by piece, with construction stopping 
when classifier logic 113 has reached the right size. 

The first technique, in which classifier logic 113 is first 
made much larger than necessary and then pruned, is exem- 
plified by the C4.5 system, described in J. Ross Quinlan, 
C4.5: Programs for Machine Learning, Morgan Kaufman, 
San Mateo, Calif., 1993. In this system, induction program 
205 produces a decision tree from classified data 201 which 
correctly classifies the data and then prunes the decision tree. 
One version of C4.5, called C4.5RULES, converts the 
unpruned decision tree to a set of rules by traversing the 
decision tree from the root to each leaf in turn. The result of 
each traversal to a leaf is a rule. The set of rules is then 
pruned to produce a smaller set which will also correctly 
classify the data. 

The drawback of this technique is that it does not work 
well with example sets that are large and noisy. In the 
machine learning context, a noisy data set is one which does 
not permit generation of a set of rules in which a classifi- 
cation produced by a given rule is exactly correct but rather 
only permits generation of a set of rules in which the 
classification produced by a given rule is probably correct 
As the size and/or the noisiness of the example data set 
increase, the technique becomes expensive in terms of both 
computation time and memory space. With regard to time, 
the technique's time requirements asymptotically approach 
0(n 4 ), where n is the number of classified data items 203 in 
classified data 201. With regard to space, the technique 
requires that the entire decision tree be constructed in 
memory and in the case of the rule version, that there be 
storage space for all of the rules produced from the decision 
tree. Some improvement of the foregoing is possible with 
problems where there are only two classes of data items, but 
even the improved technique requires 0(n 3 ) time and 0(n 2 ) 
space. 

The second technique is much less expensive in terms of 
time and space. This technique, called Incremental Reduced 
Error Pruning, or IREP, is explained in detail in Johannes F 
umkranz and Gerhard Widmer, "Incremental reduced error 
pruning", in: Machine Learning: Proceedings of the Elev- 
enth Annual Conference, Morgan Kaufmann, New 
Brunswick, NX, 1994. IREP builds up classifier logic 113 as 
a set of rules, one rule at a time. After a rule is found, all 
examples covered by the rule (both positive and negative) 
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arc deleted from classified data 201. This process is repeated the description length of the rule set The invention features 

until there are no positive examples, or until the last rule two types of such optimization. In one type, rules arc pruned 

found by IREP has an unacceptably large error rate. from the rule set to reduce the description length. In another 

In order to build a rule. IREP uses the following strategy. type, rules in the rule set are modified to reduce the 

First the examples from classified data 201 which are not 5 description length. In a preferred enrichment foe rule set is 

covered by any rule are randomly partitioned into two first pruned and the pruned rule set is then modified, 

subsets, a growing set and a pruning set Additional improvement is achieved by iterating with any 

Next, a rule is "grown" using a technique such as FOIL, cxamp lc data items which are not covered by the rules in the 

described in detail in J. R. Quinlan and R. M. Cameron- optimi2C<l ^It set. New rules are generated for those data 

Jones. "FOIL: a Midterm ReporT, in: Pavel B. Brazdu\ ed., ]Q ^n^described above and added to the rule set produced 

Machine "n^T by the first iteration. The new rule set is then optimized 

putcr Science #667), Spnnger-Verlag Vienna, Austria, ^ nl ^ rft ^ soruntil ^ 

A^G, orA^e, where A„ is a nominal attribute and v is a rule set 

legal value f or A„, or A c is a continuous variable and 6 is 15 In other aspects of the invention, the set of rules is 
some value for A,, that occurs in the training data. A produced by inducing the rules one by one and pruning each 
condition is selected to be added when adding the condition rule as it is produced. Production of rules continues until a 
maximizes FOIL'S information gain criterion. Conditions stopping condition is satisfied. The invention further pro- 
arc added until the rule covers no negative examples from better techniques for pruning individual rules and a 
the growing dataset. 20 better rule value metric for detennining when to stop prun- 
Once grown, the rule is immediately pruned. Pruning is inga Also provided is a stopping condition for the rule 
implemented by deleting a single final condition of the rule 5Ct ^ based on the description length of the rule set 
and choosing the deletion that maximizes the function ^th the new rule relative to the smallest description length 

obtained for any of the rule sets thus far. Finally, IREP has 
v<Kuk, iwp™. iWNcg) a Py^ w > (l) 25 been improved to support missing attributes, numerical 

variables, and multiple classes, 
where P (respectively N) is the total number of examples in Other objects and advantages of the apparatus and meth- 
FrunePos (PruneNeg) and p (n) is the number of examples ^ Closed herein will be apparent to those of ordinary 
in PrunePos (PruneNeg) covered by Rule. This process is skfll m m c art upon perusal of the following Drawing and 
repeated until no deletion improves the value of v. Rules 30 detailed Description, wherein: 

thus grown and pruned are added to the rule set until the 

accuracy of the last rule added is less than the accuracy of BRIEF DESCRIPTION OF THE DRAWING 

& Sle?mdeed overcome the time and space problems FIG. 1 is a block diagram of a prior-art classifier; 

rx^y^e^chnique. IREP has a running time of * FIG. 2 is a block diagram of a prior-art system for 

OOnlog^) and because it grows its rule set, also has far inducing classifier logic; 

smaller space requirements man the first technique. Expert- FIG. 3 is a diagram of modules in an induction program; 

ments with IREP and C4.5RULES suggest that it would take pjg 4 ^ a flowchart of a first rule induction method; 

about 79 CPU years for C4.5RULES to produce a rule set FIG 5 is a flowchart of a second rule induction method; 

from an example data set ^ ^^ta i^s wMe « rf a ^ ^ 

IREP can produce a rule set from mat data set in / uru * c ~ 

SeTlREP is thus fast enough to be used in many FIG. 7 is P^^JVJ^T^ cmboduMIIt ° f a 

interactive ar^cations, while C4.5RUUES is not There are first portion of the method of FIG. 6; 

however two problems with IREP. The first is that rule sets FIG. 8 is pseudo-code for a preferred embodiment of a 

made using the first technique make substantially fewer 45 second portion of the method of FIG. 6; 

classification errors than those made using IREP. The second FIG. 9 is pseudo-code for a preferred embodiment of a 

is that IREP fails to converge on some data sets, that is, third portion of the method of FIG. 6; 

exposing IREP to more classified examples from these data pjQ ^ pseudo-code for a preferred embodiment of a 

sets does not reduce the error rate of the rules. fourth portion of the method of FIG. 6; and 

It is an object of me mvention to provide a techmque for *o ^ u i s pseudo-code for a preferred embodiment of a 

inducing a set of rules which has time and space require- " oa of ^ m ^ hod of nG 6 . 

ments on the order of those for IREP, but which converges ^ numbe rs in the Drawing have two parts: the 

and produces sets of rules which classify as well as those ^^SSS^m « mc Dl L>er of an item in a 

produced by the first technique. ^ figure; the r ainin g digits are the number of the figure in 

SUMMARY OF THE INVENTION which the item first appears. Thus, an item with the reference 

The foregoing and other problems of the art arc solved by number 201 first appears in FIG. 2. 

making a rule set which is substantially smaller than the DETAILED DESCRIPTION OF A PREFERRED 

largest rule set that can be made by the method being usee EMBODIMENT 

and then producing a final rule set by optimizing the original 60 m 

rule set with regard to the rule set as a whole, Making a small In the following, the new technique for inducing a set of 

rule set gives the time and space advantages of the IREP rules is described in three stages: first, an improved version 

approach while optimization with regard to the rule set as of IREP called IREP* is presented; then a technique for 

Twhole substantially improves the quality of the classifica- optimizing the rule set produced by IREP* is set forth; next, 

tion produced by the rule set 65 a method which combines IREP* and the optimization is 

A particularly advantageous way of optimizing with described. This method is termed RIPPER (to Repeated 

regime rule set asawholei^ Incremental Eruning to Eroduce Error Reduction). Finally, 
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an iterative version of RIPPER called RIPPERk is pre- the integer k. The estimated number of bits required to send 

sented. Thereupon, details are provided of the {referred the theory is then multiplied by 0.5 to adjust for possible 

embodiment's implementation of salient portions of IREP* redundancy in the attributes. 

mtv^r^A number of bits needed to send exceptions is deter- 

Tfl^-jAi* mnn4j , /w , . . _ J mined as follows, where T is the number of exceptions, C is 

filro m^ ^T ^Tk.^ 4 ?J * c number of examples covered, U is tbe number of 

first part of IREP* 403 is loop 414 which builds the set of examples not c^ecTe is the number of errors, fp is the 

rules rule-for-rule. At step 407. a rule 119(0 is grown in the . * * ■ IT ^TT " f y 11 , 

fashion described abovefor IREP. H»e next stef 409, is to Z^J^^Z^ S/V ^J"^ 

prune rule 119(0. In contrast to IREP, any finalsequence of ° egative ""^ 1116 number of blts to send «ceptions is 

conditions in rule 119(i) is considered for pruning and that 10 D 

sequence is retained which maximizes a rule-value metric tf (r>T/2) then 
function 



.•(Rule, PnmePos, IWNcg) = ^ 



The above function is for rules that classify the data items 
into two classes, p represents the number of positive data After the stopping condition has been met, the rule set is 
items, that is, those that the rule successfully classifies as pruned in step 415. The pruning is done in a preferred 
members of the rule's class, n represents the number of 20 embodiment by examining each rule in turn (starting with 
negative data items, mat is those that the rule successfully the last rule added), computing the description length of the 
classifies as not being members of the rule's class. After rule rule set with and without the rule, and deleting any rule 
119(0 has been grown and pruned, it is added to rule set 120 whose absence reduces the description length. 
(4U). Together, the rule-value metric used in pruning step 409 

Decision block 413 determines whether the stopping 25 and the stopping metric used in stopping condition 413 of 
condition for rule set 120 has been met If it has not, loop IREP* 403 substantially improve IREPs performance. 
414 is repeated. Proper choice of the stopping condition IREP* 408 converges on data sets upon which IREP fails to 
ensures that rule set 120 is large enough to properly classify converge and the rule sets produced using IREP* 403 do 
the data but small enough to avoid the time and space substantially better at making correct classifications than 
problems of techniques such as those used in the C4.5 30 those produced using IREP. In tests on a suite of data sets 
system. In the preferred embodiment, the stopping condition used for detennining the performance of systems for induc- 
ts determined as follows using the Minimum Description ing rules, sets of rules produced by IREP* 403 had 6% more 
Length Principle. As set forth at Quinlan, C4.5: Programs classification errors than sets of rules produced by 
for Machine Learning, supra, p. 5 If., the principle states that C4.5RULES, while sets of rules produced by IREP had 1 3% 
the best set of rules derivable from the training data will 35 more errors. 

m i nimiz e the number of bits required to encode a message IREP* improves on other aspects of IREP as well. As 
consisting of the set of rules together with the those data originally implemented, IREP did not support missing 
items which are not correctly classified by the rules and are attribute values in a data item, attributes with numerical 
therefore exceptions to them. The length of this message for values, or multiple classes. Missing attribute values are 
a given set of rules is the description length of the rule set, 40 handled like this: all tests involving the attribute A are 
and the best role set is the one with the minimum description defined to fail on instances for which the value of A is 
length. missing. This encourages IREP* to separate out the positive 

In IREP* 403, the description length is used like mis to examples using tests that are known to succeed, 
determine whether the rule set is large enough: After each IREP* or any method which induces rules that can 
rule is added, the description length for the new rule set is 45 distinguish two classes can be extended to handle multiple 
computed. IREP* 403 stops adding rules when this descrip- classes in this fashion: First, the classes are ordered. In the 
tion length is more than d bits larger than the smallest preferred embodiment the ordering is always in increasing 
description length obtained for any rule set so far, or when order of prevalence — Le., the ordering is C lf . . . , C k where 
there are no more positive examples. In the preferred C t is the least prevalent class and C k is the most prevalent 
embodiment, d=64. 50 Then, the two-class rule induction method is used to find a 

In the preferred embodiment, the scheme used to encode rule set that separates C x from the remaining classes; this is 
the description length of a rule set and its exceptions is done by splitting the example data into a class of positive 
described in J. Ross Quinlan, "MDL and categorical theories data which includes only examples labeled C 2 and a class of 
(continued)" in: Machine Learning: Proceedings of the negative data which contains examples of all the other 
Twelfth International Conference, Lake Tahoe, Calif., 1995, 55 classes and then calling the two-class rule induction method 
Morgan Kaufmann. One part of this encoding scheme can be to induce rules for C x . When this is done, all data items 
used to determine the number of bits needed to send a rule classified as belonging to C, by the those rules are removed 
with k conditions. The part of interest allows one to identify from the data set Next, all instances covered by the learned 
a subset of k elements of a known set of n elements using rule set are removed from the dataset. The above process is 

60 repeated with each of the remaining classes . . . ,C t until 
S(a K p)3*iog2— +(»-*) log, -pi — only C k remains; this class will be used as the default class. 

p ~ p Optimization of the Rule Set: FIG. 4 

A problem with IREP is that the effect of a given rule on 
bits, where p is known by the recipient of the message. Thus the quality of the set of rules as a whole is never considered, 
we allow I0dh-S(n Jcjc/n) bits to send a rule with k conditions, 65 IREP* 403 begins to deal with this problem with step 415 of 
where n is the number of possible conditions that could pruning the rule set, as described above. A further approach 
appear in a rule and llkfl is the number of bits needed to send to dealing with this problem is optimization step 417. The 
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so as to minimize the eiror of the entire rule set. t—Lment 31S far makinc replacement rules, a component 

m mcprefer^emr^nt, ^^'n^ -^^^^^ da ^ gcomp o nen t 3 19 

^T^Z SlTtLS . ?i ? de«dmgw^^ 

alternative rules are constructed. The replacement fa JR, * ^GS TJlSp^ode fox an implementation of 

formed by growing and then pruning a "UeR, where J^l^^SS^ 303 andthe rule set opti- 

pruning is guided so as to rninimize exror of the entire rule *e rule set maBng ^ ^ for 

shomdtoclude me revised ime.tr^r^ ls 'ijlitL with FIG.7,ripper 701 is the top levelfunction 

original rule. This is done by inserting each of die vanants j5^R^ me preferred embodiroeot U 

o«Mn« merule set andthe. deleting nUes ^"STctS^ks Ml as an argument and 

description length of the rules and exanples.ThedescnptJon totes a ^rfXZ) 7MThe part of ripper 701 labeled 

Sh of the examples and the simplified role set is then ^^^f^ 7 ^^, ofTlabeled 601 
used to compare variants of R, and the variant is chosen 20 ™.^™*^J^ ™w ^ 701, invokes the 

which produces the rule set with me shortest description ^^^^^JS S^ts loop 414 of 

length. flowchart 401 and produces a first set of rules for the dataset 

RIPPER: FIG. 15 , H««ification Then the function reducc_dlen 705 

decision block 503, the rule set ^^ °* c .^ ^7MuTe taction Timize./ules 707 implements 

date items to see if there ate any data iternswtach are not ^ «iTwfth the function add_rules 

sets produced by RIPPER 501 now make only 1% more npper 701 returns the Mat ruie wbl 

classification errors than those produced by C4.5RULES. 35 ad J^£* ^ me tawtad b y ripper 701 . 

RTPPERk: HO. 6 w . , . add _,!« •« Aown at 801 in FIG. 8. The first step, 803, is 

Further performance improvements can be obtained by »ddjules ^ by a rule that is already in 

placing looVsil from RIPPER 501 in another loop which '^f^^^S^im new rules are added 

trj^g^iuwaotcovc^tytertc^^ 804 u^*?^K^ndition occurs. To build 
rules for those data items to the set of rules to reduce an 40 "^^^^^^ partitioned into a set of 

augmented rule set, and then optimizing «" ~*^™™*H>d a set of data for testing it for 

set using the techniques <^R^ wh^'isTe X" P^fs (805). Then the new rule is built(806). 

version of the technique called RffPERk. where k is the g^W^ ^ m .. empty rule" that has the class 

S^SS^^SSSZ ^^rule^dhavedieclassforwhichniles 

SS^S* KIPPER Joop 511 is ««^to o^anue c^entiyt^ng made. ^ ^ 

•ctwMAcow^UrfOteeMi^Tla^i^jM ^T^TUsfuncuMisshownatWinHG.J.Loop 

optimized in step «3 in toc ^°» <£^ n ^ ™* ^Sca^sions one at a time until there are no 

regard to optimization step 417 and thereupon pruned* 904 Mas togicu «pre M each logical 

Ascribed with regard to P™™***?™^J^?™ SSSoTTSde* to Ttaformation gain is computed as 

of the technique was run on the trial ^^^Tne «P«™^ *^ value 90S. When the stopping 

r ^tv^iS^l^tSTe » LddSg-logi-1 expressions is -ched^e rule 

mat produced by C4J5RULEii and kititmw reuuncu returned . ^0^, the logical expression is added to the 

0<nlog*n) running ^^^^5^7-11 SnVglvTe^ples no longefo^vered by the refined 

Details of a Preferred Embodiment. HGS. 3, J^ 1 ™£ ^ ycd ^mt^ data set and the loop is repeated. 

The foregoing techniques are implemented m a preferred ^^^^^^ taction prunes the new rule 

embodiment by means of an improved induction program N«£ at ^"V^^ ^ ^ 910 of the 

301, shown in FIG. 3. Induction program 301 includes two 60 simpUfy for pruning, 

sets of components. Rule set making components 303 makes ^<*™ ^^f^Z^MtM metric. If 

the rule set; a rule set optimizer 305, optimizes the rule set toe fundi.* ^^^2eK« pruning is better than 

Rule set making components 303 include a ruk > growmg ft. ^"^S achieved, the "g is retained; 

component 307. which grows individual rules, a rule prun- ^^J^IL^ when a pruning is retained, the 

^Z^^^^^^^S « ^^,^0^ u^pnfning are removed 
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shown in detail in FIG. 10 at 1001. The part of gen__value 
which is of importance for the present discussion is at 1005, 
where the rule-value metric discussed supra is shown at 
1007. 

At 811. the function reject_jule is invoked to check the 
stopping condition. Pseudo-code for the function is at 901. 
As shown at 901, the preferred embodiment has two stop- 
ping conditions. The first stopping condition to be checked 
(911) uses the description length and indicates that the 
stopping condition has occurred when the description length 
which results when the current rule is added to the rule set 
is larger than the shortest description length yet attained for 
the rule set by an amount which is greater than or equal to 
the constant amount MAX_DECOMPRESSION. If this 
stopping condition has not occurred, the function 901 checks 
at 913 whether the rule to be added has an error rate of more 
than 50%; again, if it does, the function indicates that the 
stopping condition has occurred. When the stopping condi- 
tion has occurred, the variable last_rulc__accepted is set to 
FALSE, which terminates loop 804. If the stopping condi- 
tion has not occurred, the examples covered by the new rule 
are removed from the data (813) and the new rule is added 
to the rule set (815). 
rednce_dlen 

The reduce_dlen function (705) prunes the rule set pro- 
duced by add_juies. The function 705 is shown in detail at 
1109 in FIG. U. The function 1109 consists mostly of loop 
1111, which, for each rule in turn, makes a copy of the 
current rule set without the rule and then computes the 
description lengths of the current rule set with and without 
the rule. If the current rule set without the rule has the shorter 
description length (1113), that rule set becomes the current 
rule set The description length is computed by the function 
totai_dlen, shown at 1115. total_dleo first uses the function 
data_dlen to compute the description length of the data 
items which are exceptions to the current rule set (1117) and 
then makes the description length for the entire rule set As 
shown at 1119, that is done by starting with the description 
length of the data items and then adding to it the description 
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length of each rule in turn. As for data_dlen, mat function 40 * ™ mvention oth 
is shown in detail at 1101. The function rimnlv imni^ntc ciosed m ffic pseudo-code. 



45 



is shown in detail at 1101. The function simply implements 
the method described in the Quinlan 1995 reference dis- 
cussed supra, 
optimize^rules 

This function 709 takes the rule set produced by IREP* 
403 and optimizes it by making a new rule for each rule in 
the rule set, making a modified rule for each rule in the rule 
set and then using the description lengths of the rule set with 
the original rule, with the new rule, and with the modified 
rule to select one of the three for inclusion in the optimized so 
rule set The function 709 contains loop 712, which is 
executed for each rule in the rule set For each rule, the 
function saves the old rule (710). It then makes a new rule 
(713) in the same manner as explained for add __rules; next 
it makes a modified rule (715) by adding logical expressions 
to the old rule. Adding and pruning are again done as 
explained far add_jules. Next, the rule that yields the rule 
set with the shortest description length is chosen (717). Then 
the examples covered by the chosen rule are removed from 
the example data (721). 

The function used to compute the description length is 
relateive_compression, shown in detail in FIG. 10 at 1009. 
The function 1009 first produces a copy of the rule set with 
the chosen rule and prunes the rule set using reduoe_ dlen 



each of the pruned rule sets (1015), and finally the function 
1009 returns the difference between the description length 
for the exceptions for the rule set without the rule and the 
sum of the description length for the exceptions for the rule 
set with the rule plus the description length of the rule 
(1017). The computation of the description lengths is done 
using data_dlen as already described above. 

Conclusion 

The foregoing Detailed Description has disclosed to those 
skilled in the art the best mode presently known to the 
inventor of practicing his techniques for inducing rule sets 
for classifiers from example data sets. The techniques dis- 
closed herein produce rule sets which are as accurate as 
those produced by systems such as C4.5, but the production 
of the rule sets requires far fewer computational resources. 
Resources are saved by producing a rule set which has "just 
enough" rules; accuracy is obtained by the stopping condi- 
tions used to terminate rule pruning and rule set growth and 
20 by optimization techniques which optimize the rule set with 
regard to the rule set as a whole. Iteration increases the 
effectiveness of the optimization techniques. A particular 
advantage of the techniques disclosed herein is their use of 
description length to determine the stopping condition and to 
optimize the rule set 

As will be Immediately apparent to those skilled in the art, 
many embodiments of the techniques other than those 
disclosed herein are possible. For example, the preferred 
embodiment uses an improvement of IREP to produce the 
rule set; however, any other technique may be used which 
similarly produces "just enough** rules. Further, the pre- 
ferred embodiment uses description length to optimize with 
regard to the entire rule set; however, other optimization 
techniques which optimize with regard to the entire rule set 
may be used as welt Moreover, optimization techniques 
other than the pruning and modification techniques disclosed 
herein may be employed. Finally, those skilled in the art are 
easily capable of producing implementations of the prin- 
ciples of the invention other than the implementation dis- 



25 



30 



35 



55 



60 



All of the above being the case, the foregoing Detailed 
Description is to be understood as being in every respect 
illustrative and exemplary, but not restrictive, and the scope 
of the invention disclosed herein is not to be determined 
from the Detailed Description, but rather from the claims as 
interpreted according to the full breadth permitted by the 
law. 

What is claimed is: 

1. A method practiced in a computer system which 
includes a processor and a memory system of inducing sets 
of classification logic rules for classifying data items from 
an example dataset of the data items, the sets of classifica- 
tion logic rules and the example dataset being stored in the 
memory system and the method comprising the steps per- 
formed in the processor of: 

inducing a first rule set from the example dataset accord- 
ing to a predetermined method, the first rule set being 
substantially smaller than a largest rule set producible 
by the predetermined method, and storing the first rule 
set in the memory system; and 

optimizing the first rule set with regard to the largest rule 
set to produce a second rule set 

2. The method set forth in claim 1, further comprising the 
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(1011); then the function 1009 does the same with a copy of 65 steps of: 

tocrule set without the chosen rule (1013); then the function after producing the second rule set, producing a third rule 

1009 computes the description length of the exceptions for set by adding rules to the second rule set to cover data 



01/08/2004, EAST Version: 1.4.1 



5,719,692 

12 



sct> ping condition occurs. 

3 The method set forth in claim 2, wherein the methodis 5 13 The method set forth in daim 12, wherein the step of 

iterated 0 times and the second rule set is the new second indacing & c fi^t rule set includes the step of checking a 

rule set produced in the nth iteration . . description length of the first rule set to determine whether 

first rule set orthe third rule set and using the description ^ description length of the first rule set is 

length in the optimization. performed repeatedly and includes the step of comparing a 

$. The method set for* in claim 4, wherein the step of yalue of mc description length of the first rule set 

optimizing the first rule set or the thkd rule set mctodes the ^ ^^pdon length thus far obtained to 

^^"^S^S^SoM and^aco^ristog 15 determine whether the stopping condition has occurred. 

MelsSdto maximize a function comparing the description length determines that the stop- 

ping condition has occurred when the current value of the 
p-" „ description length of the first rule set is more than a 

p * n predetermined value larger than the shortest description 

where p is a number of positive examples for the rule in the length. m a colter system which 

^ampfe dataset and a is a number of negative examples fa, J^ a ^^ a ^^ moftoducingll set 

7 Tne method set forth in claim 6, wherein the step of 2J of classification logic rules for classifying data items from 

pruning the first rule set is done by deleting rules from the m exalnple ^tastf of the data items, the roles and the 

first rule set such that the description length of the first rule cxample dataset being stored In the memory system and the 

set is reduced. .k. t m method comprising the steps performed in the processor for 

g. The method set form in claim 5, wherein thestep of 

pruning the first rule set is done by deleting ndes from the 30 each rule of. 

festruk set such mat the description length of the first rule inducing the rule on the example dataset, 

set is reduced adding the rule to the set of classification logic rules; 

£tiS set comprises the steps performed $5 logic rules with the added rule; and 

for each rule in the first or the third rule set of: terminating the method if the description length satisfies 

making a modification of the rule and pruning the modi- a rjedetennined condition, 
fication to minimize an error 

ofmefirstormeminlrulc ^ ^ of daim 16, wherein the preoetennined 

set; and condition is the description length which is a predeternuned 

ESft^ff m?m^oo - amount larger than a smallest F eviously^mputed desenp- 

t^S^ ^Tmcthod of claim 16, further comprising the step 

r^^^n^^'s^s of: performed for each rule of pruning the set of class* cauon 

making a first modification independently of the rule; and ^ logic rules to maximize a function 
making a second modification by adding conditions to the ^ 

rule; and p +n 
the step of detennimng determines whether to replace the 

rule with the first modification or the second modin- 

SdST , so where P is a number of positive examples for the rule in tte 

11 The method set forth in claim 9, wherein the step of sample dataset and n is a number of negative examples for 
optimizing further comprises pruning the first or me third ^ ^ 
rule set by deleting rules from the first or the third rule set 

such that the description length of the or the third rule * * • ♦ ♦ 

set is reduced. 
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